-
Notifications
You must be signed in to change notification settings - Fork 108
feat(query): RowBinaryWithNamesAndTypes
for enchanced type safety
#221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments regarding the intermediate implementation.
rowbinary/src/decoders.rs
Outdated
} | ||
let result = String::from_utf8_lossy(&buffer.copy_to_bytes(length)).to_string(); | ||
Ok(result) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more or less the same as the implementation in the deserializer. Perhaps as a follow-up, all the reader logic can be extracted in similar functions with #[inline(always)]
?
rowbinary/src/error.rs
Outdated
|
||
#[error("Type parsing error: {0}")] | ||
TypeParsingError(String), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs revising.
0 => visitor.visit_some(&mut RowBinaryDeserializer { | ||
input: self.input, | ||
validator: inner_data_type_validator, | ||
}), | ||
1 => visitor.visit_none(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently the main drawback of the validation implementation if we want to disable it after the first N rows for better performance. If these first rows are all NULLs, then we do not properly validate the inner type.
RowBinaryWithNamesAndTypes
for enchanced type safetyRowBinaryWithNamesAndTypes
for enchanced type safety
RowBinaryWithNamesAndTypes
for enchanced type safetyRowBinaryWithNamesAndTypes
for enchanced type safety
return Ok(()); | ||
} | ||
Ok(_) => { | ||
// TODO: or panic instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning error when we're already returning Result
seems correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not make sense to handle this error and continue if we cannot even parse the column header with names and types. That means everything goes entirely wrong, and should be unreachable... unless there are some odd network/LB issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still seems wrong to panic over bytes controlled by another actor
assert_eq!(actual, sample()); | ||
} | ||
} | ||
// #[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intend to restore this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will restore, but not sure if we need this test with so many integration tests that do essentially the same.
|
||
shift += 7; | ||
if shift > 57 { | ||
// TODO: what about another error? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the rationale behind 57?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.expect("failed to fetch string"); | ||
assert_eq!(result, "\x01\x02\x03\\ \"\'"); | ||
// | ||
// let result = client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intended to restore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing. Don't know why it was commented out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces support for the RowBinaryWithNamesAndTypes (RBWNAT) format for enhanced type safety in query deserialization, along with new validation modes and improvements in error messages and benchmarks. Key changes include:
- Adding new macros and tests to assert panic conditions on schema mismatches.
- Refactoring query execution to use RBWNAT and propagating a client-wide validation mode.
- Enhancements to serialization/deserialization, including a new utility for LEB128 encoding and improved columns header parsing.
Reviewed Changes
Copilot reviewed 32 out of 32 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
tests/it/main.rs | Added macros to assert panics during fetch operations. |
tests/it/insert.rs | Updated table creation and row type in the rename insert test. |
tests/it/cursor_stats.rs | Adjusted expected decoded byte count to account for the RBWNAT header. |
tests/it/cursor_error.rs | Revised error handling and test scenarios for query timeouts. |
src/validation_mode.rs | Introduced new validation mode enum with documentation. |
src/rowbinary/ser.rs | Switched to using put_leb128 and replaced an error return with a panic. |
src/cursors/row.rs | Implemented async header reading and conditionally validated rows. |
examples/mock.rs | Updated the mock provide handler to include schema information. |
benches/select_*.rs | Updated benchmarks to pass the validation mode to the client. |
Cargo.toml | Updated bench configuration and dependency versions. |
Comments suppressed due to low confidence (1)
src/rowbinary/tests.rs:117
- The test 'it_deserializes' is commented out, which may reduce test coverage for deserialization; please re-enable or provide context for the deactivation.
// #[test]
// fn it_deserializes() { ... }
return Err(Error::VariantDiscriminatorIsOutOfBound( | ||
variant_index as usize, | ||
)); | ||
panic!("max number of types in the Variant data type is 255, got {variant_index}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of panicking when the variant index exceeds 255, consider returning a proper error to allow for graceful error handling.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not make sense to handle this error; this means the entire deserialization went wrong, cause the server ensures max 255 variants per type.
Some comments from Claude Code (could be complete slop, I know little about Rust): General1. Panic-Driven Error Handling is Unacceptable
Recommendation: Convert all 2. API Breaking Changes Without Semver
Performance optimizations🔥 High-Impact Optimizations1. Eliminate Allocations in Error PathsLocation: // Current: Allocates strings in panic paths
format!("{}.{}", self.get_struct_name(), c.name)
"Struct".to_string()
// Better: Use Cow<str> for zero-allocation error messages
use std::borrow::Cow;
fn get_current_column_name(&self) -> Cow<'static, str> {
// avoid format! allocation
} 2. Optimize String DeserializationLocation: // Current: Always allocates Vec for String
fn read_vec(&mut self, size: usize) -> Result<Vec<u8>> {
Ok(self.read_slice(size)?.to_vec()) // ❌ Always allocates
}
// Better: Only allocate when necessary
fn deserialize_string<V: Visitor<'data>>(self, visitor: V) -> Result<V::Value> {
let slice = self.read_slice(size)?;
match str::from_utf8(slice) {
Ok(s) => visitor.visit_borrowed_str(s), // Zero-copy!
Err(_) => {
let string = String::from_utf8_lossy(slice).into_owned();
visitor.visit_string(string) // Only allocate for invalid UTF-8
}
}
} 3. Branch Prediction Optimization in ValidationLocation: // Current: Pattern matching on validation count
let (result, not_enough_data) = match self.rows_to_validate {
0 => rowbinary::deserialize_from::<T>(&mut slice, &[]),
u64::MAX => rowbinary::deserialize_from::<T>(&mut slice, &self.columns),
_ => { /* ... */ }
};
// Better: Likely/unlikely hints for branch predictor
let (result, not_enough_data) = if likely(self.rows_to_validate > 0) {
if self.rows_to_validate == u64::MAX {
rowbinary::deserialize_from::<T>(&mut slice, &self.columns)
} else {
self.rows_to_validate -= 1;
rowbinary::deserialize_from::<T>(&mut slice, &self.columns)
}
} else {
rowbinary::deserialize_from::<T>(&mut slice, &[])
}; 🎯 Medium-Impact Optimizations4. Validation State CachingLocation: // Current: Validates every field access
self.validator.validate(serde_type)?;
// Better: Cache validation results for repeated patterns
struct CachedValidator {
last_column_idx: usize,
last_validation: Option<ValidatedState>,
} 5. SIMD-Optimized Size ChecksLocation: // Current: Individual size checks
ensure_size(&mut self.input, core::mem::size_of::<$ty>())?;
// Better: Batch size checks for multiple fields
fn ensure_sizes_batch(input: &[u8], sizes: &[usize]) -> Result<()> {
// SIMD-optimized batch boundary checking
} 6. Avoid Repeated Column LookupsLocation: // Current: String formatting on every error
format!("{}.{}", self.get_struct_name(), c.name)
// Better: Pre-format common error prefixes
struct ErrorContext {
column_prefix: String, // Pre-computed once per struct
} |
@ilidemi, thanks. Here are some comments on that:
It is more than acceptable here. It panics in case of an invalid struct definition in the code, there is no reason to continue (de)serializing junk. It is unsafe, and Rust guidelines explicitly say the following about trying to continue with incorrect values:
I get that we have exactly this situation. See the explanation about the Result:
It is not possible to recover a program that uses the crate from an incorrect struct definition. It must be fixed by the user.
Well, RBWNAT instead of RB is the intention. It will also go as 0.14.0, where breaking changes are actually allowed. RowCursor is private to the end user, it's visibility is
Worth looking into.
Hints are unstable, and we cannot use them. But RowCursor became ~20% slower, that is a fact, and ideally we need to find a way to reduce the overhead, but I couldn't yet.
It already validates only one array value and one key-value pair of a Map.
Worth checking indeed.
There are no errors, only panics, so does not really matter IMO. |
25% hit rate, definitely space for improvement 😊 Got a few more from Opus 4:
|
Summary
Warning
This is a work in progress implementation and may change significantly.
It implements RBWNAT for Query only; Insert should be a new PR.
First of all, let's abbreviate RowBinaryWithNamesAndTypes format as RBWNAT, and the regular RowBinary as just RB for simplicity.
There is a significant amount of issues created in the repo regarding schema incompatibility or obscure error messages in the repository (see the full list below). The reason is that the deserialization is effectively implemented in a "data-driven" way, where the user structures dictate the way the stream in RB should be (de)serialized, so it is possible to have a hiccup where two UInt32 may be deserialized as a single UInt64, which in worst case scenario may lead to corrupted data. For example:
This test will deserialize a wrong value on the main branch, cause DateTime64 is streamed as 8 bytes (Int64), and 2x(U)Int32 are also streamed as 8 bytes in total. It correctly throws an error on this branch now with enabled validation mode.
This PR introduces:
StructValidationMode
client option, which has two possible modes:First(1)
(default) - uses RBWNAT and checks the types for the first row only, so it retains most of the performance compared to the Disabled mode, while still providing significantly stronger guarantees.EachRow
- uses RBWNAT and every single row is validated. It is expected to be significantly slower than the default mode.types
internal crate that contains utils to deal with RBWNAT and Native data types strings parsing into a proper AST. Rustified from https://github.com/ClickHouse/clickhouse-js/blob/main/packages/client-common/src/parse/column_types.ts, but not entirely. The most important part is the correctness and the tests, the actual implementation detail can be adjusted in the follow-up.HashMap<K, V>
, and not only asVec<(K, V)>
, which was confusing.Likely possible to implement:
visit_map
API allowed fordeserialize_struct
instead of currentvisit_seq
, which processes a struct as a tuple.Source files to look at:
Current benchmarks results
Select numbers
This branch:
Main branch:
Still losing a bit when validating only the first record. Each row validation mode, as expected, is significantly slower. But NYC taxi data (a more real world scenario cause no one streams
system.numbers
I guess..) shows totally different and very promising results.NYC taxi data
This branch:
Main branch:
The difference is, in fact, not that great, especially considering the benefits
Each
validation mode provides. Perhaps it is not a bad idea useEach
as the default mode instead ofFirst(1)
?Issues overview
Note
If an issue is checked in the list, that means there is also a test that demonstrates proper error messages in case of schema mismatch.
Resolved issues
FixedString
type #49Query::fetch_one()
andQuery::fetch_optional()
can still finish successfully even in the case of schema mismatch #187Custom("premature end of input")
onmax(DateTime64)
query on empty table #218 - clearer error messagesRelated issues
Previously closed issues with unclear error messages
CANNOT_READ_ALL_DATA
Error when serializing Nested types with Map fields #214InvalidUtf8Encoding(Utf8Error { valid_up_to: 84, error_len: Some(1) })
error #173u8
,i8
, andbool
#100 - a similar issue to 2xUInt32 decoded into 1xInt64Follow-up issues
RowBinaryWithNamesAndTypes
#10 and Consideration of Type Safety #199 - should be closed after RBWNAT insert implementationCannot read all data
, after callingend()
#59